In this project, I will analyze the white wine dataset which includes wine quality rates and wine chemical properties. The objective of the analysis is to detemine which properties influence the overall wine quality.
In this section, a histogram, boxplot, and summary will be provided for each wine property. This output should provide a high level overview of dataset.
This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The primary focus of this analysis is quality. Specifically, Can we identify which properties will have the greatest impact on the quality evaluation. Based on the defitions provided, I believe the analysis will show the following properties will have the most signifcant impact.
At this point, strictly based on gut instinct, I believe alcohol, pH, chlorides, denisty, and others properties will impact the quality of the wine. However, I rarely drink wine and have an extremely limited knownledge of the subject. In addition, I’ll be watching properties that have a wide range between the 1st and 3rd Quantile.
Yes, based on the defitions provided in the dataset, the acid levels could largely impact the qaulity. Therefore, I created an additional properties total.acids
Most of the properties have long rigt tails. Plus, the alcohol distrubtion was interesting. In addition, I was surprised by the number of properities with exemtre outliers like sugar. However, after a little thought, when you consider wines range from dry to sweet, isn’t surprising with the residual sugar mean of 6.391 and max of 65.8 grams. Also, the histogram clearly shows there are no terrible or excellent wines with all wines rated scoring between 3 and 9.
This grid histograms by quality shows both the impact of each property as well illustrates that there are few samples of lower and higher quality wines.
Using the output of ggpairs, we can see the relation between alcohol and quality. So, we’ll plot the impact for each property by alochol and quality.
It appears higher pH, free sulfur dixiode, and alcohol have a positive impact on quality. While higher chlorides, total sulfur dioxide, density, and volatile acidity also have a negative impact on quality. Based on the initial visualizations, it appears alcohol has the most significant impact. Next, we’ll look at the Coorelation coefficient.
## [,1]
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## free.sulfur.dioxide 0.008158067
## total.sulfur.dioxide -0.174737218
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
## total.acids -0.136319694
## pH.bucket 0.097255315
## rounded.free.sulfur.dioxide 0.007895089
Reviewing the Coorelation coefficients comfirms the information observed in the boxplots.
This visualization clearly shows the impact of percent alcohol on quality. Selecting a wine with a percent alcohol of 12,13, or 14 means you’ll likely select a wine quality of 7,8, or 9.
Mean pH level for wines with quality less than or equal to 6
with(subset(wine,quality <= 6),mean(pH))
## [1] 3.180847
Mean pH level for wines with quality greater than or equal to 7
with(subset(wine,quality >= 7),mean(pH))
## [1] 3.215132
It’s somewhat difficult to see in the chart, wines with a lower quality rating, less than or equal to 6, have lower pH level. In addition, wines with a higher quality, great than or equal to 7, rating have a higher mean pH.
All wines with a quality of 7,8, and 9 have a chloride level less than or equal to 0.1. Also, lower quality wines seem to have a higher chloride level.
Lower quality wines tend to have higher total sulfur dioxide level. Of the wines with a quality great than 5, 0.6777164 % have a total sulfur dioxide less than 150.
While the boxplot showed a relationship between quality and density, it’s hard to glean any conclusion from this chart. However, it appears wines with a between .98 and .99 score higher.
Based on this chart, it appears high volatile acid has a negative impact on quality.
Interesting but nothing stands out…
Next, We’ll look at the impact of alcohol content on other wine properties.
## [,1]
## fixed.acidity -0.12088112
## volatile.acidity 0.06771794
## citric.acid -0.07572873
## residual.sugar -0.45063122
## chlorides -0.36018871
## free.sulfur.dioxide -0.25010394
## total.sulfur.dioxide -0.44889210
## density -0.78013762
## pH 0.12143210
## sulphates -0.01743277
## total.acids -0.11229714
## rounded.free.sulfur.dioxide -0.24688985
This set of charts illustrate the impacts of the fermentation process and the relationship it has on all the other property values. In addition, it helps provide visual confirmation on the correlation coefficient data provided above. Again, as alcohol content increase all other property levels decrease except pH level and volatile.acidity.
As it relates to quality, the visualizations and Coorelation coefficient indicate the following:
Density and Residual Sugar, both linked with alcohol, also show a strong relationship.
The negative relationship between alcohol content and density.
The strongest relationship that impacted the quality of the wine was alcohol at 0.436. However, the strongest relationship overall was Residual Sugar and Density: 0.839.
With some basic data analysis complete, the data shows the strong relationship between alcohol content and quality. Next, we’ll review the impacts of three positive properties (pH, sulphates, and free sulfur dixiode) and three negative properties (density, chlorides, and volatile acidity) when compared to quality and alcohol.
Lower quality wines have lower alcohol and lower pH levels.
Sulphates seem to have little impact on quality.
There are 180 wines with a quality rating greater than 7. Of those wines, 158 have percent alcohol great than 10 and 135 have a free sulfur dioxide greater than 20mg.
Lower quality wines have higher density.
Lower quality wines have higher chloride levels.
Lower quality wines have lower alcohol content and higher volatile acids.
Based on the analysis, it appears to be much easier to produce a poor quality wine. I was surprised high chloride levels impact the wine quality. In addition, There are 180 wines with a quality rating greater than 7. Of those wines, 158 wines have percent alcohol greater than 10 and 135 have a free sulfur dioxide greater than 20mg. It’s interesting to see that free sulfur dioxide level have a positive impact. While high total sulfur dioxide levels have a negative impact.
Clearly, the fermentation process has an impact on alcohol content which in turn impacts other properties. It would be interesting to sample wines that have a higher percent alcohol to determine when the alcohol contents has a negative impact on the quality.
Yes… but, I don’t fully understand the output.
Throughout the analysis of the white wine samples, alcohol content has been the strongest factor in wine quality. Now, Let’s review the negative coorelation coefficients that impact wine quality. By summing properties with a negative coorelation coefficient, it becomes extremely clear how these values impact the quality of the wine. As a general observation, lower quality wines have an alcohol content of 12% with higher levels of total acidity, chlorides, density, and total sulphur dioxide. On the flipside, higher quality wines have lower levels of total acidity, chlorides, density, and total sulphur dioxide with an alcohol content above 12%. *Note - This chart includes wine quality 4 and above.
Now, Let’s review the two positive coorelation coefficients that impact wine quality. By creating ratio of free sulfur dioxide / pH, we can see most higher quality wines have a free sulfur dioxide /pH between 5-18 and alcohol content above 11%. *Note - This chart includes wine quality 4 and above.
As you can see from this chart and previous charts, when alcohol content is high, chlorides, sulphates, total acids, and residual sugar are low. This typically results in a higher quality wine. This chart provides a lay person like myself a simple method for selecting a quality wine. Selecting a wine with an alcohol content of eleven or higher will most likely provide the consumer an average to better then average wine tasting experience.
I started analyzing the dataset by looking at several univariate plots inlcuding a simple histogram and boxplot. With the exception of alcohol, density, and residual sugar, most wine properties had more outliers then I expected. The boxplot findings lead me to revise the histograms adding the density line and vertical mean line. The density line and vertical mean line, made it easier to identify that most properties had long right tails and are right-skewed.
After obtaining a general understanding of the dataset using the univariate plots, I developed several bivariate plots. First histogram by quality, this provided the user something visually interesting and evidence that there are few high and low quality wine samples, but overall I found the graphs difficult to read. Next, the box plots started providing insight into the relationship between wine properties and quality. The boxplot clearly showed the impact alcohol has on quality. In addition, the box plots helped identify the negative impacts of higher chlorides, total sulfur dioxide, density, and volatile acidity on quality. After reviewing the boxplot findings, I developed several additional visualization to confirm my findings including Percent of Wines by Quality for each positive and negative coorelationship coefficient .
Last, I created several additional multivariate plots. Using the data points provided, I confirmed that alcohol has the largest impact on quality. In addition, it appears the negative properties like Density, Chlorides, Volatile Acidity, and Total Sulfur Dioxide have a impact on quality.
This dataset provides nearly 5000 observations of white wines sampled and rated by wine experts. The sample contained eleven different properties for white wine plus the quality rating. While on the surface this seemed like ample data to provide an accurate analysis, I found myself wanting more information. Primarily, equal sample of percent alcohol would help to prove that alcohol is the most significant factor in wine quality. In addition, other properties like location, variety of the grapes, person sampling the wine, and etc. would help to provide a more complete assessment of the wines.
As a person that doesn’t drink wine on a relgular basis, this analysis has lead me to believe that selecting a bottle of wine with a high alcohol content will increse my chances of choosing a higher quality wine.
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
http://stackoverflow.com/questions/24776200/ggplot-replace-count-with-percentage-in-geom-bar
http://docs.ggplot2.org/current/
http://cran.r-project.org/web/packages/RColorBrewer/index.html
http://www.realsimple.com/holidays-entertaining/entertaining/food-drink/alcohol-content-wine
http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
http://galahad.well.ox.ac.uk/repro/
# I don't fully understand this code
# but it produced the output I wanted
# http://galahad.well.ox.ac.uk/repro/
wineSummary <- apply(wine[, 1:11], 2,
function(x) tapply(x,wine$quality, summary))
wineSummary <- lapply(wineSummary, do.call, what = rbind)
wineSummary
mvp <- function(yv,xv,c,row,column){
ggplot(aes_string(y = yv, x = xv ),
data = wine) +
coord_trans(x = "log10") +
coord_trans(y = "log10") +
geom_point(aes_string(color=c)) +
stat_ellipse(aes(color = quality.as.factor),
linetype = 1,type = "t", level = 0.50) +
coord_cartesian(ylim = c(min(wine[[yv]]),
quantile(wine[[yv]], 0.99)),
xlim = c(min(wine[[xv]]),
quantile(wine[[xv]], 0.99))) +
ggtitle(paste(yv," vs ", xv, "\n by Quality")) +
scale_color_brewer(palette = "RdYlGn", guide = guide_legend(reverse=TRUE))
}
wine$bounded.sulfur.dioxide <- wine$total.sulfur.dioxide -
wine$free.sulfur.dioxide
names <- c(
"alcohol"
,"fixed.acidity"
,"volatile.acidity"
,"citric.acid"
,"residual.sugar"
,"chlorides"
,"free.sulfur.dioxide"
,"bounded.sulfur.dioxide"
,"total.sulfur.dioxide"
,"density"
,"pH"
,"sulphates"
,"total.acids"
)
for(y in names){
for(x in names){
if(y != x)
print(mvp(y,x,"quality.as.factor"))
}
}
Comments
Our white wine dataset has quality ratings from 3-9. If we break that into three groups (bad (3-4), averge(5-7), good(8-9)), we find the majority of the wine quality for this dataset falls between 5 and 7 which is a range for average wines.
I found the boxplot helpful for identifing outliers. Also, the boxplot provides a visual for the IQR. Between the histogram and boxplot, I guess the properties with small or narrow IQR will impact wine quality.
It’s a little surprising to see the only property without outliers is alcohol. Is the alcohol content for wine regualated?